Andreas Ammer, VRVis, ammer@vrvis.at
Denis Gracanin, Virginia Tech, gracanin@vt.edu
Zoltan Konyha, VRVis, konyha@vrvis.at
Kresimir Matkovic, VRVis, matkovic@vrvis.at
Cagatay Turkay, University of Bergen, Cagatay.Turkay@ii.uib.no
[PRIMARY contact]
We used our interactive, multiple linked views
visualization application ComVis in our analysis. ComVis can visualize scalar and
categorical data in several different views. Each view is
interactive and brushable. Brushes
defined in the same view or in different views can be combined using Boolean
operators (composite brushing).
We used Python scripts to:
- Split each document in the dataset into sub-reports in order to have more
accurate relation mining
- Categorize words in our word-document matrix
- Automatically create a node-edge structure from the selected words and
sub-reports to create files suitable for our graph drawing tool
We used R statistical computing system (http://www.r-project.org/) to do text
mining. In R, we used the text mining package, tm (http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf).
This package was used to:
- Remove stop-words
(i.e. a, the, and etc.)
- Remove punctuation
- Remove suffixes
(stemming)
- Provide a
term-document matrix (200 terms * 5 documents) which contains the most frequent
terms in documents along with their frequencies
We used Tulip graph visualization software (http://tulip.labri.fr) to visualize the graph structure we build
automatically with Python scripts. Tulip software enables us to visualize the
graph structure with different layouts and use a variety of visual properties
to better analyze the relational data. In addition, this tool provides
different forms of interaction with the graph.
Video:
ANSWERS:
MC1.1: Summarize the activities that happened in each country with respect
to illegal arms deals based on a synthesis of the information from the
different report types and sources.
State the situation in each country at the end of the period (i.e. the
end of the information you have been given) with respect to illegal arms deals
being pursued. Present a hypothesis
about the next activities you expect to take place, with respect to the people,
groups, and countries.
Text Mining package provides us a
term-document matrix which contains the frequencies of the most frequent terms
per document, plus each term’s total frequency. The preprocessing steps
(mentioned in the Tools section) enabled us to filter out unnecessary
terms (i.e. stop words) and make more efficient frequency counts (i.e. by
stemming and removing punctuation). Each term is classified into one of these five
categories 1) General Term 2) City 3)
Country 4) People & Organization 5) Month. First, each word is classified
as a "General Term". Next, the word list is checked
against the lists of country names (~246 countries), city names (~37000 cities)
and month names (12 months) and classified. We marked the
"People & Organization" names manually due to a
limited number of such words.
After the classification phase, the frequency
lists are exported to a ComVis suitable format. In ComVis, we employed
i)
"Term frequency vs term" scatter plots involving the whole dataset.
ii)
A histogram to indicate classification distribution of terms and to brush
certain classes of terms.
iii)
A parallel coordinate plot where each axis represents a document and each
line is a term with their respective frequencies in each document. This
visualization enables to see how terms are distibuted on different sources of
information and to limit our analyses to certain types of documents.
iv)
A tag cloud to highlight the terms that are brushed
Using basic linking &
brushing, we can select subsets of words which come from certain
frequency ranges and categories. Being able to see the frequencies over
different document types enables the analyst to see how much certain actors or
places are mentioned in different sources of information. Most frequent general
terms generally gave us no clue about the story, however when we brushed a
certain frequency range, we ended up with more interesting terms like: “parts,
Carabobo, vwhombre, vwparts4salecheep, burj, etc.”, a comparative screenshot
can be seen in Figure 1.1.
Figure 1.1 – First column visualizes
term classifications, second column visualizes term frequencies and third
column visualizes brushed terms. In both rows, “general term” subsets
from different frequency ranges are brushed. In top row we see that
most frequent terms are not very interesting. However, brushed subset in
bottom row contains more interesting terms.
In order to start our
country-wise analysis, we first brush country and place names and steer our
analysis with respect to the resulting place list (Figure
1.2). Parallel coordinates shows that while telephone and newspaper
records contain a large variety of place names with low frequency, intelligence
reports contain fewer place names with higher frequency.
Figure 1.2 – City and country names
brushed. In addition to the first three views, the parallel coordinate plot shows
distribution of terms over documents.
Using the prominent words in
documents, we ended up with the following country-wise analysis results:
Most of the information
regarding Pakistan and Gaza are derived from intelligence reports and
telephone intercepts. Figure 1.3 displays most of the prominent actors and
places in these documents.
Pakistan – In Karachi the organization named
Lashkar-e-Jhangvi is quite active. Bukhari and Bhutani are key people in this organization.
They meet at apartments and mosques. They possibly ordered guns (possibly
arrived in black boxes) and there were a lot of money transfers from
Bukhari's account.
Gaza - Kasem, Khouri and Anka are the key people in
Gaza. They are preparing for a mission and frequently talk on the phone.
Figure 1.3- Result of composite brushing sequence: 1) brush intelligence reports
2) add telephone reports by an OR brush 3) use a diff brush over the histogram to select only place and people names
documents.
Russia - There are a number of gun dealers active in
Moscow: Nicolai Kuryakin, Boonmee Khemkhaengare, Dombrowski. They are running business
with contacts in Nigeria, Yemen and planning a meeting with people in Pakistan,
Gaza, Turkey,Kenya and Venezuela in Dubai. They made a deal with Yemeni Salah Ahmed
and received diamonds for gun shipments.
Turkey & Syria - There is a joint effort from people in
Turkey and Syria to organize activities. They are looking for supplies and will
join the meeting in Dubai with Russian gun dealers.
Nigeria - They are trying to make deals with Russian
gun dealers. This relation and important actors were discovered by analyzing
the email reports with a special interest on terms like “Mikhail,
boyohotmailcom, joetomskauru” which we discovered as a result of the analysis
in Figure 1.4.
Kenya – Thabiti Otieno and Nahid Owiti are
suppliers for illegal arm trade, Otieno is working with Russians (in a
telephone talk, Thabiti mentioned Nahid’s name) and selling arms to Sudan. They
were planning to join the meeting in Dubai.
Figure 1.4 – Report
on Email transactions were analyzed with a special interest on the terms in tag
cloud display.
North Korea – Russian gun dealers buy guns from North
Korea. A plane is intercepted in Thailand carrying guns from North Korea to Kiev.
Yemen - Saleh Ahmed is an arm dealer in Yemen and he
is trying to purchase guns from Russians using his contact with Dombrowski.
Hypothesis for future activities: After the meeting in Dubai, all these
different organizations in Pakistan, Gaza, Turkey, Syria and Yemen will
probably make deals with the Russian gun dealers. We can expect to have more
money transfers from these people to dealers in Russia. After the shipments are
made, armed organizations will most likely organize terrorist activities in near
future in Pakistan, Gaza, Turkey and Syria. We can expect a joint activity
between people in Syria and Turkey as they had frequent communications on the
phone. Organization named Lashkar-e-Jhangvi will most likely organize an event
in Karachi. Kasem, Anka and Khouri will organize an armed event in Gaza.
MC1.2: Illustrate the associations among the players in the arms
dealing through a social network. If
there are linkages among countries, please highlight these as well in the
social network. Our analysts are
interested in seeing different views of the social network that might help them
in counterintelligence activities (people, places, activities, communication
patterns that are key to the network.
All the documents provided in this challenge
consisted of smaller reports. Prior to our analyses, we first split the
documents into separate reports. In order to build relation graphs, we followed
the following procedure:
-
The user selects a subset of the terms using Comvis and exports this
subset
-
Each pair combination in this subset is checked against a
relation. Two terms are considered to be related if they appear in the same
report.
-
The relations are converted into a relational graph which is to be
visualized in Tulip Software.
The nodes in the graph are color coded with
respect to their categories as seen in Figure 2.1.
Figure 2.1 – Categories
and their respective node colors.
This automatic graph
building mechanism helped us to discover many relations by selecting different
subsets. The graph structure we get by using all terms resulted in a
cluttered graph which is very hard to interpret. Therefore, we employed smaller
relational graphs which are constructed from subsets. This automatic approach
has some drawbacks:
-
Some people’s names and surnames appear as a node more than once i.e.
Name and surname of Boonmee Khemkhaengare is displayed twice in the graph. This
should be taken into consideration by the analyst.
-
Some common names like Ali, Mohammed etc. can result in some artificial
links in the graph but these nodes and their respective edges can be removed
manually by the analyst.
-
Typos can result in duplicate nodes in the graph, i.e. We have two nodes
for “Kasam” and “Kasem” referring to same person. These duplications should be
removed manually by the analyst. In Figure 2.3 this problem is annotated with
selection 1.
A subset of the terms
consisting of only “People & Organization” and “Place” names is displayed
in Figure 2.2. In this figure we can see the following relations:
Figure 2.2 – Graph to
depict the relations between people and places. The selected sub-graphs
indicate separate groups of relations
In Figure 2.3, we can see the above relation graph without the place
names. This resulted in a clearer image of the overall relations. However, certain relations are not depicted due
to lack of spatial information, i.e. the relation in sub-graph 6 in Figure 2.2
is not observable in Figure 2.3.
Figure 2.3 – Graph depicting
only people and organization names. This graph is the result of brushing only
“People and Organization” terms in ComVis. Graph production error due to typos
in dataset is indicated with label 1.
The relations can be analyzed with deeper detail by
adding nodes for general terms. In Figure 2.4, the graph consists of all
terms mentioned in telephone intercept reports.
We can clearly see the actors which communicate via telephone.
Figure 2.4 – Graph
to depict the relations made over telephone.
By utilizing a subset involving e-mail and message
board reports, we ended up with the structure in Figure 2.5. We discovered the
following information:
Figure 2.5 – Graph
to depict mail and message board relations